Purpose

The purpose of this document is to illustrate time series analysis and forecasting. We will use a simulated dataset to analyze things like visits, discharges and payments. To perform these analyses we will be following the modeltime workflow. This report will be broken down into sections that follow that same workflow.

Data View

Lets take a look at our data and see what it has.

df_tbl %>%
  glimpse()
## Rows: 25,279
## Columns: 12
## $ mrn                     <chr> "66727914", "84487881", "68427598", "39652414"~
## $ visit_id                <chr> "1283065398", "1171004549", "1951203647", "149~
## $ visit_start_date_time   <dttm> 2011-12-26 01:14:00, 2011-12-31 05:00:00, 201~
## $ visit_end_date_time     <dttm> 2012-01-01 12:06:00, 2012-01-01 12:51:00, 201~
## $ total_charge_amount     <dbl> 62580.61, 38466.48, 31758.50, 14699.61, 66096.~
## $ total_adjustment_amount <dbl> -39117.58, -26930.67, -23706.23, -10841.80, -7~
## $ total_payment_amount    <dbl> -23463.03, -11535.81, -8052.27, -3857.81, -587~
## $ payer_grouping          <chr> "Commercial", "Blue Cross", "Blue Cross", "Blu~
## $ service_line            <chr> "Medical", "Surgical", "Medical", "Chest Pain"~
## $ ip_op_flag              <chr> "I", "I", "I", "I", "I", "O", "I", "I", "O", "~
## $ adm_date                <date> 2011-12-26, 2011-12-31, 2011-12-28, 2011-12-3~
## $ dsch_date               <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-0~
skim(df_tbl)
Data summary
Name df_tbl
Number of rows 25279
Number of columns 12
_______________________
Column type frequency:
character 5
Date 2
numeric 3
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
mrn 0 1 8 8 0 16789 0
visit_id 0 1 10 10 0 25279 0
payer_grouping 0 1 10 10 0 2 0
service_line 0 1 2 44 0 27 0
ip_op_flag 0 1 1 1 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
adm_date 0 1 2011-12-19 2019-12-31 2015-05-27 2916
dsch_date 0 1 2012-01-01 2019-12-31 2015-05-29 2887

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_charge_amount 0 1.00 34260.35 48285.83 0.5 10847.83 19475.11 39463.87 1109001.99 ▇▁▁▁▁
total_adjustment_amount 0 1.00 -22550.46 36053.24 -914728.0 -25220.65 -11619.57 -6951.65 63627.62 ▁▁▁▁▇
total_payment_amount 586 0.98 -11584.37 18165.44 -495119.8 -13132.33 -5999.98 -2920.08 436.01 ▁▁▁▁▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
visit_start_date_time 0 1 2011-12-19 21:33:00 2019-12-31 05:00:00 2015-05-27 01:55:00 16731
visit_end_date_time 0 1 2012-01-01 12:06:00 2019-12-31 22:58:00 2015-05-29 00:00:00 15184

Preparing Data

Our objectives are to:

Our forecasting will focus on a grouped forecast where we are going to forecast the number of discharges by inpatient/outpatient visit type and by payer grouping.

We are going to do this on a weekly scale.

Aggregate discharges by IP/OP and Payer Grouping by Week

  1. Start with df_tbl
  2. Use summarise_by_time() with .by = "week", and n() the visits.
  3. Save as a new variable called transactions_weekly_tbl
transactions_weekly_tbl <- df_tbl %>%
  filter(payer_grouping != "?") %>%
  group_by(ip_op_flag, payer_grouping) %>%
  summarise_by_time(
    .date_var = dsch_date
    , .by     = "week"
    , value   = n()
  )

transactions_weekly_tbl
## # A tibble: 1,667 x 4
## # Groups:   ip_op_flag, payer_grouping [4]
##    ip_op_flag payer_grouping dsch_date  value
##    <chr>      <chr>          <date>     <int>
##  1 I          Blue Cross     2012-01-01    37
##  2 I          Blue Cross     2012-01-08    31
##  3 I          Blue Cross     2012-01-15    36
##  4 I          Blue Cross     2012-01-22    32
##  5 I          Blue Cross     2012-01-29    40
##  6 I          Blue Cross     2012-02-05    38
##  7 I          Blue Cross     2012-02-12    31
##  8 I          Blue Cross     2012-02-19    37
##  9 I          Blue Cross     2012-02-26    38
## 10 I          Blue Cross     2012-03-04    31
## # ... with 1,657 more rows

Visualizations

Visualize Discharges

Use plot_time_series() to visualize the discharges.

  • Look for outliers & any data issues
  • Try out a log() transformation to see the effect on the time series
transactions_weekly_tbl %>%
  plot_time_series(
    .date_var     = dsch_date
    , .color_var  = payer_grouping
    , .facet_vars = payer_grouping
    , .facet_ncol = 2
    , .value      = log(value)
    , .smooth     = FALSE
  )

Visualize ACF

Visualize the ACF using plot_acf_diagnostics() using a log() transformation. Look for:

  • Any frequencies we can include?
  • Any lags we can include? (Hint - What is our forecast horizon?)
transactions_weekly_tbl %>%
  ungroup() %>%
  plot_acf_diagnostics(dsch_date, log(value))

Log-Standardize Revenue (Target)

  • Start with transactions_weekly_tbl
  • Apply log-standardization:
    • Apply Log transformation using log()
    • Apply standardization to mean = 0, sd = 1 using standardize_vec()
  • Store the resulting data as transactions_trans_weekly_tbl
transactions_trans_weekly_tbl <- transactions_weekly_tbl %>%
  mutate(value = log(value)) %>%
  mutate(value = standardize_vec(value))
## Standardization Parameters
## mean: 3.08875144281386
## standard deviation: 0.367674566335952
## Standardization Parameters
## mean: 1.83577890003612
## standard deviation: 0.545791389303644
## Standardization Parameters
## mean: 3.15330156564258
## standard deviation: 0.302421031976675
## Standardization Parameters
## mean: 1.59951348649452
## standard deviation: 0.514947645076106
mean_a <- 3.08875144281386
sd_a   <- 0.367674566335952
mean_b <- 1.83577890003612
sd_b   <- 0.545791389303644
mean_c <- 3.15330156564258
sd_c   <- 0.302421031976675
mean_d <- 1.59951348649452
sd_d   <- 0.514947645076106

Visualize the log-standardized transactions using plot_time_series(). This confirms the transformation was performed successfully.

transactions_trans_weekly_tbl %>%
      plot_time_series(
    .date_var     = dsch_date
    , .color_var  = payer_grouping
    , .facet_vars = payer_grouping
    , .facet_ncol = 2
    , .value      = value
    , .smooth     = FALSE
  )

We’ll use these parameters to create our “full dataset”. We’ve selected an 14-week forecast horizon. Our lag period is 14 weeks and we’ll try out a few rolling averages at various aggregations.

horizon         <- 14
lag_period      <- 14
rolling_periods <- c(7, 14, 28, 52)

Prepare the full data

  1. Start with transactions_weekly_tbl
  2. Add the future window: Use bind_rows() and future_frame() to extend the data frame .length_out = horizon.
  3. Add autocorrelated lags: Use tk_augment_lags() to add a .lags = lag_period
  4. Add rolling features from our lag: Use tk_agument_slidify() to add .period = rolling_periods. Use mean as the rolling function. Make sure to “center” with “partial” windows.
  5. Rename any columns that contain “lag”. Modify to start with “lag_”
  6. Save the output as full_tbl.
full_tbl <- transactions_trans_weekly_tbl %>%
    
    # Add future window
    bind_rows(
        future_frame(
          .data         = .
          , .date_var   = dsch_date
          , .length_out = horizon
        )
    ) %>%
    
    # Add autocorrelated lags
    tk_augment_lags(value, .lags = lag_period) %>%
    
    # Add rolling features
    tk_augment_slidify(
        .value   = value_lag14,
        .f       = mean, 
        .period  = rolling_periods,
        .align   = "center",
        .partial = TRUE
    ) %>%
    
    # Rename columns
    rename_with(
      .cols = contains("lag")
      , .fn = ~ str_c("lag_", .)
    ) %>%
  select(dsch_date, everything())

full_tbl %>% 
  glimpse()
## Rows: 1,723
## Columns: 9
## Groups: lag_ip_op_flag, payer_grouping [4]
## $ dsch_date               <date> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-2~
## $ lag_ip_op_flag          <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "~
## $ payer_grouping          <chr> "Blue Cross", "Blue Cross", "Blue Cross", "Blu~
## $ value                   <dbl> 1.4201865, 0.9389710, 1.3456669, 1.0253210, 1.~
## $ lag_value_lag14         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_7  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_28 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_52 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~

Visualize the Full Data

Visualize the features, and review what you see.

  1. Start with full_tbl
  2. pivot_longer every column except “dsch_date”
  3. Use plot_time_series() to visualize the time series coloring by “name”.

Review the visualization selecting one feature at a time and answering the following questions:

- Do the rolling lags present any issues? 
- Which rolling lag captures the trend the best?
- Do you expect either of the Product Events features to help?
full_tbl %>%
  pivot_longer(cols = -c(dsch_date, lag_ip_op_flag, payer_grouping)) %>%
  plot_time_series(
    dsch_date
    , value
    , name
    , .smooth = FALSE
    , .facet_ncol = 2
  )

Model Data / Forecast Data Split

Create a data_prepared_tbl by filtering full_tbl where “value” is non-missing.

data_prepared_tbl <- full_tbl %>%
    filter(!is.na(value))
data_prepared_tbl
## # A tibble: 1,667 x 9
## # Groups:   lag_ip_op_flag, payer_grouping [4]
##    dsch_date  lag_ip_op_flag payer_grouping value lag_value_lag14
##    <date>     <chr>          <chr>          <dbl>           <dbl>
##  1 2012-01-01 I              Blue Cross     1.42               NA
##  2 2012-01-08 I              Blue Cross     0.939              NA
##  3 2012-01-15 I              Blue Cross     1.35               NA
##  4 2012-01-22 I              Blue Cross     1.03               NA
##  5 2012-01-29 I              Blue Cross     1.63               NA
##  6 2012-02-05 I              Blue Cross     1.49               NA
##  7 2012-02-12 I              Blue Cross     0.939              NA
##  8 2012-02-19 I              Blue Cross     1.42               NA
##  9 2012-02-26 I              Blue Cross     1.49               NA
## 10 2012-03-04 I              Blue Cross     0.939              NA
## # ... with 1,657 more rows, and 4 more variables: lag_value_lag14_roll_7 <dbl>,
## #   lag_value_lag14_roll_14 <dbl>, lag_value_lag14_roll_28 <dbl>,
## #   lag_value_lag14_roll_52 <dbl>

Create a forecast_tbl by filtering full_tbl where “value” is missing.

forecast_tbl <- full_tbl %>%
    filter(is.na(value))
forecast_tbl
## # A tibble: 56 x 9
## # Groups:   lag_ip_op_flag, payer_grouping [4]
##    dsch_date  lag_ip_op_flag payer_grouping value lag_value_lag14
##    <date>     <chr>          <chr>          <dbl>           <dbl>
##  1 2020-01-05 I              Blue Cross        NA          -1.22 
##  2 2020-01-12 I              Blue Cross        NA          -0.393
##  3 2020-01-19 I              Blue Cross        NA          -0.860
##  4 2020-01-26 I              Blue Cross        NA          -0.695
##  5 2020-02-02 I              Blue Cross        NA          -1.88 
##  6 2020-02-09 I              Blue Cross        NA          -1.42 
##  7 2020-02-16 I              Blue Cross        NA          -1.42 
##  8 2020-02-23 I              Blue Cross        NA          -2.14 
##  9 2020-03-01 I              Blue Cross        NA          -0.860
## 10 2020-03-08 I              Blue Cross        NA          -1.42 
## # ... with 46 more rows, and 4 more variables: lag_value_lag14_roll_7 <dbl>,
## #   lag_value_lag14_roll_14 <dbl>, lag_value_lag14_roll_28 <dbl>,
## #   lag_value_lag14_roll_52 <dbl>

Train / Test Split

Split into Train / Test Sets

  • Start with data_prepared_tbl
  • Use time_series_split() to create a single time series split.
    • Set assess = horizon to get the last 14-weeks of data as testing data.
    • Set cumulative = TRUE to use all of the previous data as training data.
  • Save the object as splits
splits <- data_prepared_tbl %>% 
    time_series_split(assess = horizon, cumulative = TRUE)

Feature Engineering

Create a Preprocessing recipe

Make a preprocessing recipe using recipe(). Note - It may help to prep() and juice() your recipe to see the effect of your transformations.

  • Start with recipe() using “value ~ .” and data = training(splits)
  • Add the following steps:
    • step_timeseries_signature() using the date feature
    • Remove any newly created features that:
      • Contain “.iso”
      • End with “xts”
      • Contain “day”, “hour”, “minute”, “second” or “am.pm” (because this is a weekly dataset and these features won’t add any predictive value)
    • Normalize all numeric data except for “value” (the target) with step_normalize().
    • Dummy all categorical features with step_dummy(). Set one_hot = TRUE.
    • Add a fourier series at periods 7 and 14. Set K = 2 for both.
recipe_spec_base <- recipe(
  value ~ .
  , data = training(splits) %>%
    arrange(
      lag_ip_op_flag
      , payer_grouping
      , dsch_date)
  ) %>%
    
    # Time Series Signature
    step_timeseries_signature(dsch_date) %>%
    step_rm(matches("(iso)|(xts)|(hour)|(minute)|(second)|(am.pm)")) %>%
    
    # Standardization
    step_normalize(matches("(index.num)|(year)|(yday)")) %>%
    
    # Dummy Encoding (One Hot Encoding)
    step_dummy(all_nominal(), one_hot = TRUE) %>%
    
    # Fourier - 4 Week ACF
    step_fourier(dsch_date, period = c(7, 14, 52), K = 2)

recipe_spec_base %>% 
  prep() %>% 
  juice() %>% 
  glimpse()
## Rows: 1,612
## Columns: 58
## $ dsch_date                 <date> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01~
## $ lag_value_lag14           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_7    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_14   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_28   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_52   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ value                     <dbl> 1.4201865, 0.9389710, 1.3456669, 1.0253210, ~
## $ dsch_date_index.num       <dbl> -1.725001, -1.716430, -1.707859, -1.699288, ~
## $ dsch_date_year            <dbl> -1.505993, -1.505993, -1.505993, -1.505993, ~
## $ dsch_date_half            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ dsch_date_quarter         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,~
## $ dsch_date_month           <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,~
## $ dsch_date_day             <int> 1, 8, 15, 22, 29, 5, 12, 19, 26, 4, 11, 18, ~
## $ dsch_date_wday            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ dsch_date_mday            <int> 1, 8, 15, 22, 29, 5, 12, 19, 26, 4, 11, 18, ~
## $ dsch_date_qday            <int> 1, 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78~
## $ dsch_date_yday            <dbl> -1.70440843, -1.63725147, -1.57009451, -1.50~
## $ dsch_date_mweek           <int> 5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1,~
## $ dsch_date_week            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1~
## $ dsch_date_week2           <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,~
## $ dsch_date_week3           <int> 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0,~
## $ dsch_date_week4           <int> 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,~
## $ dsch_date_mday7           <int> 1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,~
## $ lag_ip_op_flag_I          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ lag_ip_op_flag_O          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ payer_grouping_Blue.Cross <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ payer_grouping_Commercial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_01    <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_02    <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_03    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,~
## $ dsch_date_month.lbl_04    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,~
## $ dsch_date_month.lbl_05    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_06    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_07    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_08    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_09    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_10    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_11    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_12    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_1      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ dsch_date_wday.lbl_2      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_3      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_4      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_5      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_6      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_7      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_sin7_K1         <dbl> 0.37526700, 0.95866785, 0.82017225, 0.064070~
## $ dsch_date_cos7_K1         <dbl> 0.9269168, 0.2845276, -0.5721167, -0.9979454~
## $ dsch_date_sin7_K2         <dbl> 0.6956826, 0.5455349, -0.9384684, -0.1278772~
## $ dsch_date_cos7_K2         <dbl> 0.71834935, -0.83808810, -0.34536505, 0.9917~
## $ dsch_date_sin14_K1        <dbl> -0.1911586, -0.5981105, -0.8865993, -0.99948~
## $ dsch_date_cos14_K1        <dbl> -0.98155916, -0.80141362, -0.46253829, -0.03~
## $ dsch_date_sin14_K2        <dbl> 0.37526700, 0.95866785, 0.82017225, 0.064070~
## $ dsch_date_cos14_K2        <dbl> 0.9269168, 0.2845276, -0.5721167, -0.9979454~
## $ dsch_date_sin52_K1        <dbl> 0.78183148, 0.85128444, 0.90832376, 0.952117~
## $ dsch_date_cos52_K1        <dbl> 0.62348980, 0.52470449, 0.41826780, 0.305731~
## $ dsch_date_sin52_K2        <dbl> 0.9749279, 0.8933455, 0.7598452, 0.5821853, ~
## $ dsch_date_cos52_K2        <dbl> -0.22252093, -0.44937040, -0.65010409, -0.81~